Communication server, method and systems, for reducing transportation volumes over communication net works

ABSTRACT

Method for delivering data streams over communication networks is disclosed, the method comprising determining reference points in a stream of data being locations in the stream where predefined number of characters fulfill a predetermined criterion; registering digital signature being values returned from a predetermined function taken over predefined ranges of content, the ranges are in correlation with the reference points; using the digital signatures to locate locally stored content, and using the reference points or creating a dictionary and using it to synchronize between currently received pieces of data and between locally stored matching content. Communication server implementing the method is also disclosed, and further disclosed are communication systems comprising at least one said server.

FIELD OF THE INVENTION

This invention relates to the field of data transportation viacommunication networks, and more specifically to the field of reducingthe volume of such data transportation.

BACKGROUND OF THE INVENTION

The Background chapter of U.S. provisional Patent Application No.60/548,855 is herewith incorporated by reference.

Reducing bandwidth is a major desire of ISP-s (Internet serviceprovider), home-users, content providers and almost every organizationthat owns a network. Bandwidth means cost, since communication lines areactually leased according to the amount of data they transfer. To aleaser less bandwidth consumed means less money spent on renting acommunication line. To a bandwidth provider less data volume transmittedover a line by a given number of clients means additional clients thatmay be subscribed without reducing service quality.

U.S. publication No. 2002/0184333 (hereinafter D1) is aimed to improveperformances of communication networks. The problem to be solved by D1is in cases where a plurality of requestors may send requests for thesame file, wherein each request is directed to a different providerquoting different URL or even different file name. In conventionalcaching such plural requests will be related as requests for differentfiles, resulting in respective plural file deliveries.

According to D1 invention, after receiving a complete file from each ofthe plural sources, it will be assigned a digital signature, which inturn will be recognized as being identical to the digital signature ofthe same file received from any one of the other sources. Accordingly,further requests for the file from any one of the plural providers shallbe satisfied by retrieving the file from a single copy stored locally inthe cache server. The improvement is by caching frequently requestedfiles for returning such files to a requestor in response to requestorinquiry, wherein the files are indexed according to a digital signatureassigned to each file, thus avoiding multiple caching and multipleprovider-server deliveries of the same file in cases where one file isretrievable from a plurality of sources.

As can be appreciated, in order to being able to associate betweendifferent request forms and a one requested file, D1 requires that acached file will be recognized twice: first according to its digitalsignature, and second according to all URLs or file names known asassociated to that file. In case that a request for the same file isreceived and is unfamiliar to the system, it can be satisfied only byretrieving the file from the location indicated by the request, and onlyafter the file has once been retrieved and recognized as a copy of alocally stored file, the file name or URL indicated by the request canbe associated to that local file for future use (i.e. for future similarrequests). Another drawback of D1 is in that the files stored in someprovider computers can change, while the intermediate node continueproviding the requestors with an old version of a locally stored file,without being aware of the change.

WO 95/19003 publication (hereinafter D2) relates to cases where areceiving computer has an old file, a transmitting computer has a newfile, wherein portion of both files may be identical. In order to allowthe receiving computer to update the old file to be identical to the newfile, it is required that at least those portions of the new file thatare not identical to any portion of the old file be transmitted to thereceiving computer. In order to allow both receiving computer andtransmitting computer to compare what portions of the file need not betransmitted (thus speeding the transmission process) the receivingcomputer needs as a preliminary step to divide the old file to segments,to calculate a hash number to each segment and to send the hash numbersto the transmitting computer. The transmitting computer then has tocalculate hash numbers “to each possible segment in the new file” (page7 lines 23-24), then to compare the hash numbers to the hash numbersthat has received from the receiving computer.

As can be appreciated, D2 is aimed to reconstruct a file on a receivingcomputer, from segments of a new file received from a sender and fromexisting segments of pre-designated old file existing on the receivingcomputer, such that the result will be a copy of the new file on thereceiving computer. U.S. Pat. No. 5,721,907 is aimed similarly to D2. D1to D3, all relate to a receiving computer which expects a predeterminedrequested file (or part of file) to be obtained from another computerand stored on the receiving computer. When a file is known it also has aknown starting and ending points, that can be used as referencelocations for a digital signature to be computed over a known range ofdata.

Differently from the known art the present invention is aimed toinitiating transportation reduction in streams of data in real time,i.e. reducing the volume of data streams whose content is anonymous and.cannot yet be recognized as a file or a part of a file by the computerinitiating the reduction. Referring for example to live videobroadcasting, wherein a plurality of receiving computers receive thesame content in real time, the system and method disclosed by D1 areincapable of reducing the volume of transportation in the network sincethe intermediate node has no file in its cache to be retrieved asnowhere there is any actual “file” in such a case where data is createdin real time.

The D2 and D3 inventions are also irrelevant for such a case since thereis no “old file” in the receiving computers whose portions may beidentical to the expected data.

In the context of the present invention—Anchor—location in a stream ofdata determined according to content thereof fulfilling a predeterminedcriteria. The criteria is set such that a satisfactory probability forthe presence of an anchor over a predetermined amount of data isaddressed.

Anchoring—registering values returned by a predetermined function(hereinafter referred to also as “anchoring function”) operated forexamining predetermined ranges of content whose location being incorrelation with anchors. The function and the ranges of data areselected such that the returned values can be used as signaturesidentifying the content in a satisfying probability.

SUMMARY OF THE INVENTION

The present invention relates to a method and systems for synchronizingbetween anonymous contents of data streams currently passing throughcommunication server and between similar contents that have already beenpassed through said servers and stored locally, such that transportationof certain amounts of said streaming data may be eliminated.

According to some embodiments the synchronization is also betweenseveral copies of similar data content passing through the serversimultaneously. A system of the present invention comprises at least oneserver capable of reducing volumes of network transportation in-line asa result of self initiated procedures which require no informationconcerning the source of the data, its type, its name, or any other ofits identification details, in order to achieve volume reduction. Thisis in contrast to methods according which the expected volume reductiondepends on file names or routs in order to allow synchronization betweena requested file and a locally stored file. Furthermore, when a file tobe transferred is known it has also known starting and ending points,that can be used as reference locations for a digital signature to becomputed over a predetermined range of data. In case of streams ofanonymous data whose content has no starting or ending points agreed toall (as is the case according to the present invention), there are noagreed reference points allowing to calculate a repeatable digitalsignature. If a digital signature cannot be repeated, it cannot be usedto identify the content. As will be further explained the serveraccording to the present invention comprises an anchor determinationunit who solves this problem.

For further use of certain portions of the data streams passing throughit, the server according to the present invention (hereinafter will bereferred to also as “anonymous data caching server” or “ADC server”)stores such portions of the data without being aware of file names,wholeness of data, URLs, file types, and data origin or destination.According to the present invention only pure data is stored by theserver, with no ID tags received from external file requestors or fileproviders.

A system according to the present invention comprises a communicationserver having a communication circuit for receiving and deliveringstreams of data, and at least one memory media accessible thereto.

An anchor determination unit is provided in the server capable ofdetermining locations in the data streams where predefined groups ofcharacters from the stream fulfill a predetermined criteria, thelocations of such groups are determined as anchors (referred to also as“reference points”).

The server further comprises—

an anchoring-function unit for returning values (hereinafter will bereferred to also as “digital signatures”) as a function of the contentof ranges of data in the stream wherein the ranges are in a knowncorrelation with respect to the anchors (said values can then be of helpwhen searching for the content);

a data-partitioning unit for dividing the stream of data into datablocks;

a registration unit for storing lists of values returned from theanchoring-function unit, wherein each value being associated by anappropriate reference with a specific data block containing the range ofdata from which said value returned by the anchoring function (each ofthe values associated with a block will be referred to in the context ofthe present invention as “a block ID”, and in some particular cases as“a hash key”).

According to various embodiments of the present invention a plurality ofblock IDs are associated with each data block under normalcircumstances.

According to one embodiment of the present invention the blockpartitioning unit divides the data into blocks of a predetermined size.For example, the predetermined size may be chosen to be 64K of data.

According to another embodiment and in order to increase compatibilitybetween blocks containing similar data on different servers and to avoiddifferent partitioning of the data on each transmission of it, thepartitioning unit determines the starting location of blocks as afunction of data contained in the data stream. According to thisvariation the block partitioning unit activates on the data streams afunction for determining anchors, wherein the function is designed to aprobability of returning one anchor per data range of a satisfactorysize, e.g. a function according which an anchor can be expected once perabout 50K of data.

The data blocks may be saved to a caching memory for later retrievalaccording to block IDs who are associated with such blocks byappropriate references.

As mentioned above, an anchor is a location in a stream of datadetermined according to content thereof fulfilling a predeterminedcriteria. The criteria is set such that a satisfactory probability forthe presence of an anchor over a predetermined amount of data isaddressed.

For example, according to one embodiment of the invention a set ofanchors associated to a data block may be the set of locations where ashort string appears in the block, e.g. every place where “aaa” appearsin the block. According to another embodiment the set of anchors may bethe set of locations of n-tuples where a hash function over this n-tuplereturns a predefined value, e.g. every location of a triplet of bytes inthe block, whose returned hash value is 123.

Some examples of a hash functions that may be used to find an anchor areLFSR (aka CRC), DES, MD5, etc.

According to one preferred embodiment the anchoring function is designedsuch that the probability for finding an anchor is every 500 bytes inaverage. Accordingly, three anchors are expected to be found in everydata packet. In case the data is divided to data blocks of 64K each, 128anchors are expected per each block. In case that blocks of about 50Kare used, the expected number of anchors is reduced respectively andthus about 100 anchors are expected per each block.

As mentioned above, anchoring is registering values returned by apredetermined function (“anchoring function”) operated for examiningpredetermined ranges of content whose location being in correlation withanchors. The function and the ranges of data are selected such that thereturned values can be used as signatures identifying the content in asatisfying probability. According to one preferred embodiment of thepresent invention the anchoring function is chosen to be a hash functiontaken on 100 consecutive bytes starting at an anchor and returning a 96bit hash value as a digital signature.

According to one preferred embodiment of the present invention theregistration unit (or the server by any other appropriate unit) isfurther designed to register the anchors in correlation with theregistration of the block IDs. For example, after a given data block waspartitioned from the data stream by the data-partitioning unit, thelocation of the anchors are registered e.g. as offset references to bemeasured from the starting point of the block. By such registration,when a digital signature value is given, an identical value can besearched for in a list of block IDs, and if located, the block to whichit is associated may be retrieved or accessed. The data range from whichthe digital signature value has been returned, may thus be easilylocated according to the location of the anchor associated with thisvalue, measured from the block's starting point.

According to another embodiment, the anchor locations in the block arenot being registered. According to this embodiment, when a digitalsignature value is given, an identical value can be searched for in alist of block IDs, and if located, the block to which is associated maybe retrieved or accessed. Since anchor locations are not registeredaccording to this particular embodiment, the location of the data rangein the block from which the digital signature value has been returned isunknown yet. Thus, according to this embodiment a dictionary is to beused, as will be explained in the detailed description chapter. For thispurpose, the ADC server may further comprise a dictionary generatorunit.

Either through said anchor registration embodiment or through adictionary generator embodiment, a currently received data packet fromwhich a digital signature has been returned by the anchoring-functionunit can be synchronized with a similar data packet contained in a datablock stored already in the memory, such that they may easily becompared in order to verify whether they are identical. In case they areidentical,

the currently received packet can be replaced by a reference to itslocation in the block, thus reducing the volume of data sent to thereceiving end. In they are not identical the volume of data to be sentcan still be reduced by replacing identical sub-strings (if exist) withreferences to their starting and ending locations in the block.

The server of the present invention can be used in one of two basicoptions, or combinations thereof, in order to reduce transportationvolumes over communication

-   -   (i) in conjunction with at least one corresponding server of a        similar type, both servers located at remote ends of a virtual        communication line(in the context of the present invention the        term communication line includes a wireless connection, as one        of the possible options) connecting between them; and    -   (ii) in conjunction with at least one data provider computer        (hereinafter “data provider”) comprising a redirection unit;

For the purpose of facilitating the description it will be assumed thattwo anonymous data caching servers according to the present inventionare connected on two remote ends of a communication line, and are bothworking also in conjunction with data providers connected to eachthrough respective networks and routers.

As a starting point, let assume the two ADC servers has cache memoriesthat are empty. Data streams are directed through the ADC servers. EachADC server processes the data passing it for determining anchors, forcalculating digital signatures, for partitioning the data into blocks,and for creating block IDs. Under normal circumstances, the blocks arestored in the caches which are thus start to being filled with blocks ofpure data containing no identification information. The block IDs ofeach block are also being stored, such that the address of each blockstored in the cache is linked with a respective list of block IDs.Simultaneously with the build of the cache in each ADC server, theblocks being sent packet by packet to the intended direction, that isaccording to the circumstances either to the corresponding ADC server atthe opposite end of the communication line, or to conventional receiversto which the data is routed. Whenever a digital signature value isreturned from a received data stream, the stored lists of block IDs inthe ADC server are searched for identical value. Upon recognition of ablock ID having such identical value, the block associated to this blockID is accessed. The server now locates the specific point in that blockfrom which the block ID was returned by the anchoring function, andsynchronizes between the location of the currently received data packetand the location in the block of the originally received data packet,e.g. by using the known offset of the anchor from the starting point ofthe block (according to the embodiment wherein the anchors are savedwith the block IDs), or e.g. through the use of a dictionary (accordingto another embodiment).

The currently received packet and the original one, can now be comparedfor verifying are they identical, or for determining what portions ofthem (or what large substrings of them) are identical.

In case that they are, (and according to various other embodiments incase that certain portions of them are identical), the ADC serverinitiate a procedure for reducing the volume of the data stream which isnow expected to contain data which is already stored locally.

If the source of the data is a data provider, the initiating ADC serversends a message to that data provider to activate anchoring function andto send only references to locations of anchors in the block, instead ofsending the data itself. As long as the references received allows theserver to retrieve whole packets from the locally stored block, the dataprovider keeps sending references, instead of full data. If at any givenpoint the server recognizes mismatches between the signatures of thereceived data and those of the locally stored one, it sends a message tothe provider to return to conventional transmission.

In case the received data streams are to be forwarded to a correspondingADC server, the initiating server send a message to the correspondingserver to retrieve from its cache a block according to a currentlyrecognized signature. One or several more currently received packets arethen sent to the corresponding server in conventional transmission mode,until the corresponding server confirms that the intended block wasrecognized and fetched.

In the initializing server—portions of the currently received packetswhich are identical to locally stored ones are then replaced byreferences to the block, allowing the corresponding server toreconstruct them from the block. Since the references are short than thedata they replace, the packet size can be certainly reduced. Reducedsized packets are thus sent to the corresponding server as long as dataidentical to the currently received one can be retrieved locally.

According to one preferred embodiment, the ADC server is further usedfor eliminating unnecessary transportation volume in cases whereidentical information flows simultaneously to different destinationthrough the server. According to this embodiment the initiating serversearches for anchors in currently received packets and sometimes mayrecognize that several locally stored packets returns identicalsignatures. This may be the case, for example, when a live video isbeing broadcasted to a plurality of receivers. In such a case, theinitiating server can send to the corresponding server a massage to usea single packet being sent, and distributing it to all the receivers towhich identical packets are waiting in the initiating server. The volumeof transportation in such case can meaningfully be reduced, and thedownloading time of large amounts of data can meaningfully be shortened.

Some basic embodiments of the invention will now be described in brief:

The present invention relates to a communication server for deliveringdata streams to a remote destination over a communication network, theserver comprising a replacement unit for replacing pieces of data fromintended incoming data streams to be received from a remote sender byidentical data pieces retrievable from a data storage accessiblethereto, according to references supplied by the remote sender;characterized by an identification unit for identifying the pieces ofdata to be replaced according to a digital signature that is a functionof data contained in said pieces, and by an anchor-determination unitfor determining locations in the data streams where predefined groups ofcharacters from the stream fulfill a predetermined criterion, thelocations of such groups being reference points to the digitalsignatures.

According to various preferred embodiments the communication server isfurther comprising messaging unit for notifying a remote sender to stopdelivering intended incoming pieces of data which are retrievable from adata storage accessible thereto.

The remote sender can be a PC delivering data (or files that the serverof the invention refers to as plain data).

According to various preferred embodiment the server units are designedto process pieces of data being packets of TCP/IP transmission protocol.

According to various preferred embodiments the server is furthercomprising a data storage accessible thereto, wherein the packets arestored in the data storage in blocks of variable size which isdetermined according to anchor location on the original data stream.

According to various embodiments the digital signature returned by theanchoring-function unit is based on any of CRC, SHA1 or DES computedvalue of a predetermined number of bytes from a selected piece of data.

According to some preferred embodiments the digital signature iscalculated from a predetermined number of bytes of data, the location ofsaid bytes in the stream of data is in correlation with at least oneanchor, and the anchor is a pointer to a location in the stream of datahaving a compatibility with a predetermined criteria.

According to various preferred embodiments said criteria is a functionof data contained in said pieces of data and is independent of a title,address or routing information of said data.

The function is responsive to a predetermined character combination suchthat an anchor is assigned upon recognition of said charactercombination.

According to various embodiments the character combination is a shortstring of predefined characters.

According to various embodiments a set of anchors is assigned to a pieceof data, each anchor from the set is in correlation to an n-tuplelocation in said piece of data wherein the function is a hash functionyielding a predefined value over the n-tuple.

According to various preferred embodiments the hash function is selectedfrom the group containing LFSR, CRC, SHA1, DES, and MD5.

The server and systems containing it may treats files delivered throughP2P communication no matter what their size is or whether they aredivided and downloaded from a plurality of providers, since the presentinvention relates to any type of files as plain anonymous data streams.

The present invention relate also to a method for delivering datastreams over communication networks, the method comprising determiningreference points in a stream of data being locations in the stream wherepredefined number of characters fulfill a predetermined criterion;registering digital signatures being values returned from apredetermined function taken over predefined ranges of content, theranges are in correlation with the reference points; using the digitalsignatures to locate locally stored content, and using the referencepoints or creating a dictionary and using it for synchronizing betweencurrently received pieces of data and between locally stored matchingcontent.

The present invention further relates to a computer readable mediacontaining instructions for controlling a computer system to implementthe method.

A system for reducing transportation volumes over communicationnetworks, comprising at least one communication server as defined by thedescription above an hereinafter is also within the scope of the presentinvention.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to understand the invention and to see how it may be carriedout in practice, a preferred embodiment will now be described, by way ofnon-limiting example only, with reference to the accompanying drawings,in which:

FIG. 1 illustrates the relation between data stream, packets containedin the data stream, data blocks partitioned from the data stream, andanchors generated according to the present invention in order to alloweffective correlation between repeating instances of similar anonymousdata streams.

FIG. 2 illustrates the relationship between blocks of data, between anarray of block IDs allowing to locate locally stored blocks, and betweena dictionary which according to some embodiments allows to locatepackets in a block.

FIG. 3 illustrates a first part of a flow chart demonstrating a processfor reducing transportation volume over a network line according to onepreferred embodiment of the present invention.

FIG. 4 illustrates a second part of the flow chart of FIG. 3.

FIG. 5 illustrates an example of system configuration according to thepresent invention.

DETAILED DESCRIPTION OF THE INVENTION

The invention relates to a method for reducing bandwidth by packetcaching, and a system for reducing bandwidth by packet caching. The coreof the invention is of storing packets and retrieving them fast in anefficient way. It will now be described how packets can be stored forthen be retrieved efficiently.

Throughout the following description a file being transferred shall bereferred to as a stream of data. This is a reasonable considerationsince devices dealing with real time streams in the net do not know inadvance the file that is being transferred and they learn the file as itis passed through them, just as a stream.

A communication server according to the present invention learns thedata as it reads packets belonging to the stream from the communicationline. The stream being learned is to be partitioned into data blocks.The size of a block is independent of the packet size. Hereafter a blocksize of 64K will be referred to as an example (although it can be of anyacceptable size whatsoever), according to one embodiment. A block alsoneed not start at a beginning of packet. As we read packets of a streamfrom the net their data is copied into a data block. When a block isfull, and contains 64K of data it will be written to the local diskafter performing some preprocessing to be discussed below. According toanother embodiment the ending position of a block and the beginning ofthe next block in a stream is determined by anchors (in a way that willbe explained later). Through the use of anchors (in contradiction to theuse of fixed sized blocks according to the other embodiment) blockpartitioning becomes dependent of its content (since the anchor isdetermined as a function of the contained data) and thus blockscontaining identical data will more likely be found over the network asbeing identically partitioned, while partitioning of fixed sized blocksmay occasionally be changed since no inherent rule determines theirpartitioning.

According to various embodiments hash functions will be used forfacilitating locating of required data during the processes takenaccording to the present invention. A hash key is defined to be a numberof n bits that depends on the value of a range of data that its size ism bytes, where the probability of having two identical hash keys for twodifferent m-byte values is very low. Hash keys can be created bycomputing the CRC value of m bytes, or by calculating their SHA1 value,DES value, or any other function known to satisfy the above condition.The decision about the specific values of n and m can be made by thosewho are skilled in the art depending on the network and the packet typethat the method is applied on. Hash keys may be used for locating arequired block on the disk. Hash keys may also be used for locating aspecific required packet in the block.

According to one preferred embodiment of the present invention a 64-bithash key taken on 100 bytes is used for allowing locating a block on thedisk.

According to another preferred embodiment of the present invention a96-bit hash key taken on 100 bytes is used for allowing locating a blockon the disk.

According to some embodiments a 16-bit hash key taken of 5 bytes is usedfor creating a dictionary that will allow to find a requested packet ina block.

According to various embodiments of the present invention anchors arepreferably selected to be dependent of only small amounts of data (e.g.a few bytes), and independent of the starting position of the block thatcontains the anchors or of the starting position of the packetscontaining them. One example for defining anchors on a stream ischoosing an anchor to be every position in the stream where the string“abc” appears. Another example for defining anchors is choosing anchorsto be every position in the stream where a 9-bit hash of 5 consecutivebytes is zero. A 9-bit CRC was chosen because when a CRC of five bytestring is given it is easy to remove the contribution of the first bytein the string and to add a new byte at the end of the string. Thus theCRC can be “rolled” over the buffer efficiently.

Every place we an anchor is being found, a 96-bit hash key is computedover the next 100 consecutive bytes. The value of the hash key returnedis called “block ID”. Obviously, according to the described procedure ablock will have a plurality of IDs. In order to prevent too many blockIDs, it is possible to skip a certain amount of data after finding ananchor, e.g. it is possible to skip 400 bytes or 500 bytes far from theprevious anchor considered, before finding the next anchor. Accordingly,it is appreciated that a packet will hold no more than three clock IDs,and that a block of 64K will hold no more than 128 block IDs.

All the block IDs are then saved in an array on the disk. This arraywill be referred to also as the “hash array”. Every block ID isassociated to one entry of the hash array, although many IDs might bymapped to the same entry, as they all refer to one specific block. Ateach entry a list of block IDs is thus kept, together with the locationof their associated block.

According to some embodiments hash keys are computed for every block, onevery m consecutive bytes of the block, and every hash key is beingstored in an array together with the position where it was generated.This array will be referred to also as “dictionary” of the block, and itwill be used according to these embodiments for locating requiredpackets in the block. According to one embodiment, and as mentionedearlier, the hash key is chosen to be 16 bits long, and it is calculatedover 5 consecutive bytes. The values of the hash keys returned from thecalculation are stored in an array, however, according to otherembodiment variations they can be stored in a list, a tree or anystructure that allows efficient searching. The dictionary thus is set tobe an array of 65536 entries (wherein each entry corresponds to onedifferent possible combination of the 16 bit key). In case a hash key hwas calculated at position p, the h-th entry of the array will be set tohold the number p. Accordingly, in order to find the position in apacket where a hash key h was computed the value stored in the h-thentry in the array should by read.

The dictionary size can be reduced by computing a hash key only forevery m consecutive bytes whose starting position inside the packet canbe divided by x, where x is a parameter that can be chosen by thedeveloper. A higher value of x will result in smaller dictionary size.For example x may be chosen to be 16.

Reference is made now to FIG. 1 which illustrates the relation betweendata stream represented by the frame 1. The dots inside the framerepresents the data content. Packets are contained in the data stream,as represented by the frame 2, which is identical to the frame 1, withthe difference that frame 2 illustrates the starting and the endingpoints of the packets by vertical lines 4. Data blocks to be partitionedfrom the data stream are represented by vertical double lines 5, andanchors 6, 7 and 8 generated according to the present invention in orderto allow effective correlation between repeating instances of similaranonymous data streams are also illustrated. In the illustrated example,the anchors are defined as a location where the combination ofcharacters ‘abc’ is found in the data stream. Accordingly, the instancesof such combinations in the stream were highlighted by explicitly typingsaid character combination. The vertical doted lines 9, 10 and 11, arepassing through the stream, packets and block representations in orderto emphasize that anchors provide inherent reference points to locationsin the data stream, such that no matter how this stream will bepartitioned, the reference points can always be recognized by activatinga function who returns an anchor whenever the predefined charactercombination is detected.

Referring to FIG. 2, three blocks stored in the cache are represented byrespective three frames marked A, B, and C. The first block, A, containsthe string “abcdeafchijk”. The dictionary of the first block isrepresented by a frame marked D. The dictionary D indicates thelocations in the block A where triplets of characters appear. The Fig.further illustrates an array of block IDs (marked E), wherein in theillustrated example two of the IDs are associated with and thusaddressing to the first block A, as represented by respective arrows 31and 32. Upon receiving a packet, it is being searched for an anchor, andwhen found, a digital signature is computed by a hash function returninga hash value from the 100 bytes following the anchor. The digitalsignature value is then searched for in an array storing block IDs. Thisarray also stores the location in the cache in which the block isstored. In case a match occurs between the digital signature value and avalue of any one of the block IDs, the block associated with thematching block ID is fetched from the cache. After fetching the blockinto memory, the dictionary can be used to find large substrings of thepacket in the block which are identical to corresponding substrings inthe currently received packet. Such substrings can then be deleted fromthe packet and replaced by references to the block.

An ADC server who receives such processed packet may retrieve saiddeleted parts of the packet from its local cache, and thus the volume ofthe transmitted data is reduced in accordance with the volume of thedeleted substrings.

The process executed according to various embodiments of the presentinvention will be further explained assuming a configuration wherein twoADC servers are connected respectively on opposite ends of acommunication line, and assuming (for simplicity of explanation) thatall communication are transmitted from the same end of the line (in thecontext of the present invention “initiator end”) and received at theother end (in the context of the present invention “receiving end”), andthat the servers at both ends of the line have run for a sufficientamount of time and have studied the information transmitted over theline, and have built the data structures explained above.

In brief, same configuration is related to according to the presentinvention as a system comprising two ADC servers, one at opposite end ofa communication line. The communication transmitted over the line passesthrough both servers. The two servers study the files and streams thatare transmitted over the line. They partition them into blocks and storethe blocks on their local disk together with a dictionary (according toone variation) or with anchor references (according to anothervariation). They also update their hash file containing the block IDs,according to newly stored blocks. When a packet of a stream istransferred, the two computers search their disks, using their hashfile, and fetch a block that was stored previously. This block is usedby the transmitting computer to replace data in the packet withreferences to data inside the block, and by the receiving computer toreconstruct the packet according to the references. Said process willnow be described in more detail.

A packet read from the network at the initiator end is a part of astream of communication. This stream of communication is distinguishedfrom other communications by its communication ID, which is the fourtuple: source IP address, destination IP address, source port anddestination port. Upon reading the packet and before it is transmittedto the receiving end, the initiator server goes over the packet to findanchors in this packet. The expected number of anchors in a packet isthree (assuming the aforementioned embodiment directed for such numberof anchors is being used). This means that there is a certainprobability that some anchor will be found. Notice that the position ofthe anchor in the stream is a function of the packet content rather thenits position in the packet. This guarantees that the anchor we found ina currently received packet corresponds to an anchor that was previouslyfound when the stream was learned, if indeed the two packet containsidentical portions of information.

After finding one such anchor the digital signature value defined atthat anchor is computed. We use the hash array to search for a block IDthat matches the digital signature returned from said calculation. Incase a match is found the block associated with the matching block ID isfetched from the cache. Meanwhile the packet is transmitted over theline to the receiving end, following a message that tells it to fetchthe same block from its disk. Since said block has been passed alreadyin the past through the initiator end (either to or from the receivingend), it is expected that under normal conditions it has to be foundalso on the receiving end. It takes the disk of the receiving end a fewmilliseconds to fetch the block. During this time, more packets of thesame stream may be transmitted unchanged (i.e. through conventionaltransmission mode) over the line. The number of these packets is notexpected to be greater than a dozen.

After the block has been fetched from the disk of the receiving end, andwhen a packet arrives from the same stream, the position of the packetinside the stream may be determined using the dictionary. For thispurpose a hash key on five bytes inside the packet is computed and thevalue h is returned. The h-th entry of the dictionary, which holds theposition where a string that generated the same hash has appeared in theblock is then read, and the data in that location in the previouslystored packet is compared with the data in the currently received packetto see if they match. If they do, the data in the packet is replacedwith an indication that the data appears in the block, together with itsposition in the block and its length. Said procedure is repeated as manytimes as needed until going over the entire packet. The server at thereceiving end of the line reconstructs the packet by copying theindicated data from the block into the packet.

In order to improve the fetching time of blocks from the cache,prefetching techniques may be applied. For example, a block (that maylater be recognized as a required one) may be prefetched before it isactually needed by identifying that the stream reached the end of thecurrent block, and then prefetching a set of blocks that it is predictedthat one of them will be needed next. For this purpose this, a list ofblocks that may be needed after a specific block is used may be studiedfor every block e.g. by means of self learning techniques.

FIG. 3 illustrates a first part of a flow chart demonstrating a processfor reducing transportation volume over a network line according to onepreferred embodiment of the present invention. The first part of theflow chart illustrates in general how a first communication server ADC1prepares for working in conjunction with a similar second communicationserver ADC2. ADC1 first reads a packet as illustrated by step 41, findanchors in the packet, calculates digital signature over predetermineddata ranges whose location is in correlation to the anchors, andsearches a list of block IDs trying to locate previously storedsignatures that are identical to currently retrieved ones, asrepresented by steps 42 and 43 of the process. In case a matching wasfound, the process proceed in step 44, by loading a block from a localcache according to the location of the block which is associated to theblock ID who was found as matching a signature of the currently receivedpacket. The data of the currently received packet is then compared withthe data in the block (after synchronizing the respective packets e.g.according to the anchors with which the signature is associated). Incase step 45 is accomplished positively and identical data was found thefirst server sends a message (as represented by step 46) to the secondserver ADC2, to fetch a block from its local cache according to the nowknown matching signature. The first server waits for a confirmation fromthe second server, and stays in non-sync mode of operation whenever aconfirmation of fetching the appropriate block has not been returnedfrom the second server, as represented by steps 47, 48, and 50, whereina non-sync mode means that packets continue to being sent to the secondserver conventionally (i.e. in their original form) as represented bystep 54. The process is thus repeated until a confirmation from thesecond server has been received and the two servers start working insync mode as represented by step 49 and as further detailed in FIG. 4,or until the compared packets are found not identical such that theprocess continues from step 45 to step 53 and the current packet whichis unknown yet to the first server is maintained in a buffer asrepresented by step 53, and further being sent conventionally to thesecond server as represented by step 54, while then the process isrepeated in reading another packet until the buffer has been filled by awhole block of unfamiliar data, which is then stored locally asrepresented by step 51, together with its block IDs and associatedanchor locations, allowing for a future retrieval of the block uponrecognition of matching signature in some data stream to be received inthe future.

FIG. 4 illustrates a second part of the flow chart of FIG. 3 which isthe sync mode wherein the two servers operate on a familiar data (i.e.which is already stored locally on the caches of them both). In the syncmode the first server reads a currently received packet as representedby step 55, compares it to the data of corresponding packet in the blockthat was loaded in step 44 (of FIG. 3), and in case the data in the twoblocks is identical the server then send instructions to the secondserver to reconstruct the data from the block that was fetched by thesecond server as response to the message that was sent to it in step 46(of FIG. 3), and to send the reconstructed packet to its destination. Incase the comparison taken in step 56 is negative, the first server senda message to the second server that the Sync mode is to be ceased asrepresented by step 59, and further sends the currently received packetto the second server in its original form. The server then change itsmode of operation to non-sync as represented by step 61, and the processstarts again, as represented by step 40 of FIG. 3.

FIG. 5 illustrates an example of system configuration according to thepresent invention, comprising two ADC servers 61 and 62 connected on twoends of a virtual communication line 63 and further communicatingthrough network lines 64 and 66, respectively, with conventionalcommunication networks represented by routers 67 and 65 and by dataproviders 69 and 68 and data receivers 73 and 72.

By such network configuration and by establishing the communicationbetween the two ADC servers over the virtual line, streams of data maybe redirected such that it is assured that the concerned data streamswill surely pass through both servers.

1-16. (canceled)
 17. A communication server configured to deliver a datastream from a remote sender to a remote destination over a communicationnetwork, the communication server comprising: a data storage unitcomprising a non-transitory computer readable medium accessible thereto:an anchor-determination unit configured to determine at least one anchorin an incoming data stream, the anchor being indicative of a location inthe stream of data where a group of characters in the data streamfulfill a predetermined criterion; the anchor being a reference pointindicative of a respective predetermined data range in said data streamfor calculating a digital signature identifying a respective data blockin said data stream; an identification unit configured to: calculatesaid digital signature over said predetermined data range in the datastream; and identify using the digital signature a previously storeddata block comprising pieces of data that are substantially identical topieces of data in said incoming data stream; a replacement unitconfigured, to: replace references to the location of pieces of data inthe incoming data stream with respective pieces of data in saidpreviously stored data block.
 18. The communication server of claim 17wherein said data block comprises a plurality of packets; said datarange for calculating said digital signature identifying said data blockis located in a first packet of said plurality of packets; saidreplacement unit is configured to replace pieces of data in at least oneother packet of said plurality of packets in said data block, said atleast one other packet are received later than said first packet. 19.The communication server of claim 17 wherein the data range consists ofa number of bytes which is independent of block size.
 20. Thecommunication server according to claim 17, wherein the pieces of dataare packets of TCP/IP transmission protocol.
 21. The communicationserver according to claim 17, wherein packets are stored in the datastorage unit in blocks of variable size which is determined according toanchor location on the original data stream.
 22. The communicationserver according to claim 17, wherein the digital signature is based onany of CRC, SHA1 or DES computed value of a predetermined number ofbytes from a selected piece of data.
 23. The communication serveraccording to claim 17, wherein the digital signature is calculated froma predetermined number of bytes of data, the location of said bytes inthe data stream is in correlation with at least one anchor, and the atleast one anchor is a pointer to a location in the data stream having acompatibility with the predetermined criterion.
 24. The communicationserver according to claim 22, wherein the predetermined criterion is afunction of data contained in said pieces of data and is independent ofa title, address or routing information of said data.
 25. Thecommunication server according to claim 23, wherein the function isresponsive to a predetermined character combination such that an anchoris assigned upon recognition of said predetermined charactercombination.
 26. The communication server according to claim 24, whereinthe predetermined character combination is a string of predefinedcharacters.
 27. The communication server according to claim 24, whereina set of anchors is assigned to a respective piece of data, each anchorfrom the set is in correlation to an n-tuple location in said respectivepiece of data, and wherein the function is a hash function yielding apredefined value over the n-tuple.
 28. A method of delivering a datastream from a remote sender to a remote destination over a communicationnetwork, the method comprising: accessing a non-transitory computerreadable media containing instructions for controlling a computer systemfor: determining at least one anchor in an incoming data stream, theanchor being indicative of a location in the stream of data where agroup of characters in the data stream fulfill a predeterminedcriterion; the anchor being a reference point indicative of a respectivepredetermined data range in said data stream for calculating a digitalsignature identifying a respective data block in said data stream;calculating said digital signature over said predetermined data range inthe data stream; and identify using the digital signature a previouslystored data block comprising pieces of data that are substantiallyidentical to pieces of data in said incoming data stream; replacingreferences to the location of pieces of data in the incoming data streamwith respective pieces of data in said previously stored data block. 29.The method of claim 28 wherein said data block comprises a plurality ofpackets; said data range for calculating said digital signatureidentifying said data block is located in a first packet of saidplurality of packets; the method comprising: replacing pieces of data inat least one other packet of said plurality of packets in said datablock, said at least one other packet are received later than said firstpacket.
 30. The method of claim 28 wherein the data range consists of anumber of bytes which is independent of block size.
 31. A systemconfigured to reduce data transportation volumes over a communicationnetwork, comprising at least a first communication server beingconfigured to deliver a data stream to a second server over acommunication network, the first communication server comprising: a datastorage unit comprising a non-transitory computer readable mediumaccessible thereto; an anchor-determination unit configured to determineat least one anchor in a data stream, the anchor being indicative of alocation in the stream of data where a group of characters in the datastream fulfill a predetermined criterion; the anchor being a referencepoint indicative of a respective predetermined data range in said datastream for calculating a digital signature identifying a respective datablock in said data stream; an identification unit configured to:calculate said digital signature over said predetermined data range inthe data stream; and identify using the digital signature a previouslystored data block that is substantially identical to the data block inthe data stream; a replacement unit configured to: replace pieces ofdata in at least one packet in the data block in the data stream withreferences to the location of respective pieces of data in thepreviously stored data block, thereby generating a reconstructed packetin the data stream; and forward the reconstructed packet to betransmitted to the second communication server.
 32. The system of claim31 further comprising said second communication server the second serverbeing operable to replace said references to the location of respectivepieces of data in the reconstructed data stream with respective piecesof data from a previously stored data block.
 33. The system according toclaim 31 wherein said data block comprises a plurality of packets; saiddata range for calculating said digital signature identifying said datablock is located in a first packet of said plurality of packets; saidreplacement unit is configured to replace pieces of data in at least oneother packet of said plurality of packets in said data block, said atleast one other packet are received later than said first packet. 34.The system according to claim 31 wherein said data range consists of anumber of bytes which is independent of block size.
 35. A method ofreducing data transportation volumes over a communication networkcomprising at least a first communication server being configured todeliver a data stream to a second server over a communication network,the method comprising: accessing a non-transitory computer readablemedia, at the first communication server, containing instructions forcontrolling a computer system for: determining at least one anchor in adata stream, the anchor being indicative of a location in the stream ofdata where a group of characters in the data stream fulfill apredetermined criterion; the anchor being a reference point indicativeof a respective predetermined data range in said data stream forcalculating a digital signature identifying a respective data block insaid data stream; calculating said digital signature over saidpredetermined data range in the data stream; and identify using thedigital signature a previously stored data block that is substantiallyidentical to the data block in the data stream; replacing pieces of datain at least one packet in the data block in the data stream withreferences to the location of respective pieces of data in thepreviously stored data block, thereby generating a reconstructed packetin the data stream; and forwarding the reconstructed packet to betransmitted to the second communication server.
 36. The method of claim35 wherein said data block comprises a plurality of packets; said datarange for calculating said digital signature identifying said data blockis located in a first packet of said plurality of packets; saidreplacement unit is configured to replace pieces of data in at least oneother packet of said plurality of packets in said data block, said atleast one other packet are received later than said first packet. 37.The method of claim 35 wherein the data range consists of a number ofbytes which is independent of block size.
 38. A computer readable mediacontaining instructions for controlling a computer system to implement amethod of a method of delivering a data stream from a remote sender to aremote destination over a communication network, the method comprising:accessing a non-transitory computer readable media containinginstructions for controlling a computer system for: determining at leastone anchor in an incoming data stream, the anchor being indicative of alocation in the stream of data where a group of characters in the datastream fulfill a predetermined criterion; the anchor being a referencepoint indicative of a respective predetermined data range in said datastream for calculating a digital signature identifying a respective datablock in said data stream; calculating said digital signature over saidpredetermined data range in the data stream; and identify using thedigital signature a previously stored data block comprising pieces ofdata that are substantially identical to pieces of data in said incomingdata stream; replacing references to the location of pieces of data inthe incoming data stream with respective pieces of data in saidpreviously stored data block.
 39. A computer readable media containinginstructions for controlling a computer system to implement a method ofa method of reducing data transportation volumes over a communicationnetwork comprising at least a first communication server beingconfigured to deliver a data stream to a second server over acommunication network, the method comprising: accessing a non-transitorycomputer readable media, at the first communication server, containinginstructions for controlling a computer system for: determining at leastone anchor in a data stream, the anchor being indicative of a locationin the stream of data where a group of characters in the data streamfulfill a predetermined criterion; the anchor being a reference pointindicative of a respective predetermined data range in said data streamfor calculating a digital signature identifying a respective data blockin said data stream; calculating said digital signature over saidpredetermined data range in the data stream; and identify using thedigital signature a previously stored data block that is substantiallyidentical to the data block in the data stream; replacing pieces of datain at least one packet in the data block in the data stream withreferences to the location of respective pieces of data in thepreviously stored data block, thereby generating a reconstructed packetin the data stream; and forwarding the reconstructed packet to betransmitted to the second communication server.